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ABSTRACT 

A statistical model for investigating predictive 
validity at highly selective institutions is described. When the 
selection ratio is small, one must typically deal with a data set 
containing relatively large amounts of missing data on both criterion 
and predictor variables. Standard statistical approaches are based on 
the strong assumption that the missing data are missing at random 
(MAR) (i.e., the missing data can be accounted for in terms of the 
observed measures), and there are no unmeasured variables that 
underlie the missing data process. The proposed model represents an 
attempt to account for any unmeasured selection variables by assuming 
that applicants are first placed into admission categories by the 
institution and then selected within each category in terms of the 
observed predictor variables. Thus, although the MAR assumption may 
not hold for the set of all applicants, it may very well hold within 
each admission category. The model uses the EM algorithm to obtain 
estimates of validity separately within each category. The model is 
quite general and can be used when there are missing data on the 
predictor and criterion variables, and even if the admission category 
is not known for each applicant. The proposed model is illustrated in 
terms of a real life data set for a selective secondary school with 
over 2,000 applicants. Four tables present analysis data. 
(Author/SLD) 
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Abstract 

A statistical model for investigating predictive 
validity at highly selective institutions is described. 
When the selection ratio is small, one must typically deal 
with a data set containing relatively large amounts of 
missing data on both the criterion and predictor variables, 
Standard statistical approaches are based on the strong 
assumption that the missing data are missing at random 
(MAR), i.e., the missing data can be accounted for in terms 
of the observed measures, and there are no unmeasured 
variables which underlie the missing data process. It is 
well known that violations in this assumption can yield 
biased estimates especially when there is a high proportion 
of missing data. The proposed model represents an attempt 
to account for any unmeasured selection variables by 
assuming that applicants are first placed into admission 
categories by the institution and then selected within each 
category in terms of the observed predictor variables. 
Thus, although the MAR assumption may not hold for the set 
of all applicants, it may very well hold within each 
admission category. The model uses the EM algorithm to 
obtain estimates of validity separately within each 
category. The model is ^uite general and can be used when 
there are missing data on the predictor and criterion 
variables, and even if the admission category is not known 
for each applicant. The proposed model is illustrated in 
terms of a real life data set. 
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INTRODUCTION 



The problem of investigating the validity of a 
selection procedure is particularly difficult for highly 
selective institutions due to the relatively large amount of 
missing data that typically occurs in this setting. Our 
original interest in this problem arose because of the need 
to investigate the validity of a real life admissions 
program at an educational institution which admits less than 
10 percent of their applicants. The basic problem is one of 
investigating the statistical relationship of a set of 
predictor variables (x lf x 2 , x p ) to some criterion 

variable y when complete data cannot be obtained. It should 
be noted that in the most general case, data will be missing 
for both the y and x variables. The parameters to be 
estimated may include the squared multiple correlation in 
predicting y from the x's, the regression weights, various 
squared semi-partial correlations showing the "importance" 
of various x variables in predicting y, and the difference 
in the expected y score of individuals selected using t ie 
x's and those who would have been selected by a lottery. 
This last parameter is especially relevant for highly 
selective institutions where it is often argued that the 
applicants are a self-selected group; thus a randomly chosen 
group of these applicants will perform at a comparable level 
to a group selected on the basis of test scores. 
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How can the observed data be used to estimate the 
parameters of interest. The answer to this guestion depends 
upon what assumptions can be made concerning the selection 
process which produced the missing data. Whereas these 

., ™ir,or effect on the analysis when the 
assumptions may have a minor eltecr. 

proportion of missing data is small, these assumptions must 
be carefully considered when there are relatively large 

.„„ The simplest (although often not 
amounts of missing data. The simpi 

realistic, assumption is that the missing data can be 
explained in terms of the observed predictor variables. For 
example, suppose only subjects who score highest on x, (an 
aptitude test, are observed on x 2 (a structured interview , 
and only those sublets scoring highest on x 2 are observed 
on y. in this example, x 2 is missing as a function of « x . 
and y is missing as a function of x 2 . When the missing data 

■ , <„ ms of the observed predictor 
can be explained simply in terms 

variables, the missing data can be described as missing at 
random (MAR, , Little I Rubin ,».7, . standard methods of 
estimation can be applied given the MAR assumption. For 

maximum livelihood estimates can be obtained by simply 

• the likelihood for the observed data. It should 
considering the liKeimo" 

b e noted, that the standard correlation corrected for 
..restriction in range," is the maximum livelihood estimate 
of the population correlation when there is a single x 
variable and y is MAR (Cohen, 19=5, Unfortunately, in many 
oases the observed predictor variables alone will not 
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explain the missing data, implying that additional 
unmeasured variables are operating. If these unaccounted 

, . ^...nu roiated to the missing data, 
for variables are statistically related to 

gi ven the observed data, the >tt assumption will not hold, 
and the estimation procedure can become quite complex since 
one must introduce a statistical model for the unmeasured 
variables, Failure to account for the unmeasured variables 
„1U in general lead to biased results (HecXman, 1976, 1979; 
Linn , 1968 ; Olson * BecKer, 19,3). In our study of the real 
Ufe data set previously described, we were confronted with 
just such a problem,- a simple inspection of the applicant x 
scores suggested that the missing data could not be 
accounted for simply in terms of the observed predictor 
variables. Many instances were found where non-accepted 
applicants had higher scores on the observed predictor 
variables than accepted applicants. Thus, the missing data 
oould not be assumed to be MAR. However, a careful analysis 
of the problem showed that the unmeasured selection 
variables could be accounted for by considering a set of 
admission categories used by the institution. Based on a 
set of measured "bacKground variables," applicants were 
first classified by the institution into one of three 
admission categories. The use of these categories basically 
represented a policy decision by the institution. For each 
category, selection was then based on the x variables. 
Thus, although the MAR assumption does not hold for the data 
set as a whole, the assumption is tenable within each 
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category. Given this structure where applicants are first 
grouped into categories and then selected in terms of the x 
variables, one can use the observed data to estimate the 
validity of the selection process separately within each 
admission category. This strategy was adopted as a model 
for investigating the validity of the admissions procedure. 
The proposed model is guite general and can be employed when 
there are missing values on both the x and y variables, and 
in addition, even if the admission category membership is 
unknown for some applicants. 

in section 2, the general statistical model is 
described together with the methods for estimating the 
validity of the predictor variables within each admission 
category. In section 3 the application of the model to the 
previously mentioned data set is described. Although we 
illustrate the usefulness of the proposed model in terms of 
a single real life data set, we believe that the model can 
be applied in validation studies at other highly selective 
institutions. The model is also not limited to educational 
settings, but could be used in industrial selection programs 
where different admission standards are used for different 
well defined applicant groups. In this third section we 
also deal with the problem of accounting for the missing y 
scores of admitted applicants who decline the admissions 
offer or who drop out prior to the measurement of y. For 
these cases, the missing data may be a function of 
unmeasured variables which are statistically related to y. 
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...v, .•his issue by performing a sensitivity 
MAR. We deal with this issue oy P 

• k different possible values are imputed for the 
analysis where different p«=» 

missi ng sco.es, ana the changes in the fine! analyse are 
n oted. The U-t section contains our conclusions and a 

• ~ ^-f fnfure research needs, 
discussion or tircure i« 

2. Statistical Model 

Me assume that the applicants are grouped into a set of 
, > x amission categories. Within each category, the p , 

• and the v variable are assumed to he multivariate 

variables and tne y va 

-a ,,^v-ianre covariance matrix 
normal with mean vector ttj and variance 

. , 2 g in the ideal case where there are no 

Lo data" one can choose hetween two different methods 
missing data, 

for estimating the unKnown parameters. First, It o 
assu »es that the £j matrices varv over the category, 
data from each admission category would he — 
analysed. Per example, g separate multiple correla on 
w ould he computed. Second, if one assumes homogene * ~ 

a single pooled sample variance covariance 
the 2-i matrices, a smyi r 

in this case a single multiple 
matrix is computed, in tnib 

rrelation would he obtained. Assuming reasonably laroe 
sampl e sises within each category, the choice between ^ese 

„ _ based on some common homogeneity of 
two options would be based 

variance-covariance test. 
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consider the case of missing data v If category 
membership is known for all applicants, but there are 
missing data on the x and y variables (assumed to be MAR 
within a category) , the two options described above will be 
available as long as there are enough complete cases within 
each category to assure that the individual Sj matrices can 
be estimated. The choice between the two models (common S 
versus separate Sj) can be ma- to using an appropriate 
statistical test. Due to the missing data, the commonly 
used homogeneity tests cannot be applied. However, one 
could test the null hypothesis of equal Sj using a log- 
likelihood ratio test. (Mood, Graybill, & Boes, 1974). 

Given that missing data are present, the general method 
used to obtain estimates in both the homogeneous and 
heterogeneous Sj models is the so called EM algorithm 
(Little & Rubin, 1987). The procedure can be described in 
general using a type of "what if" reasoning. If there were 
complete data, one would compute the usual statistics for 
estimating the parameters of interest. For example, one 
would compute the mean of the x and y variables within each 
category as well as the category variance covariance matrix 
or the pooled variance covariance matrix. These statistics 
would represent the maximum likelihood estimates. In the 
presence of missing data, one estimates these statistics by 
computing their expected values, given the observed data, 
and some initial set of parameter estimates. These expected 
values are then used as a new set of parameter estimates, 

9 ' 
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and the expectation seep is again applied. The procedure is 
continued until convergence is reached. The final expected 
values are the maximum likelihood estimates. We know of two 
computer programs for performing this analysis; BMDP-AM 
(1990) for the unequal Sj case, and a FORTRAN program 
obtained from M.D. Schlucter (personal communication, 1990) 
for the equal Sj case. It should be noted that the latter 
program can also be used in the most complex case where data 
are missing for category membership, as well as for the x 
and y variables. In other words, not only are some subjects 
missing scores on x and y, but the category membership may 
not be known for all applicants. in this case, parameter 
estimation is possible only if the common 2 model is 
assumed. The EM estimation theory for this case is more 
complex and is described in detail in Little i Rubin (1987). 

To further illustrate the estimation procedure, 
consider the following simple data set consisting of g=2 
admission categories, two x variables (x x , x 2 ) and y. 
suppose there are Bl = 6 applicants in category 1 and n 2 - 8 
applicants in category 2. The observed and missing data 
(denoted as »?») are given as follows: 
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Category 1 



xl 


X2 


y 


5 


? 


9 


3 


4 


7 


2 


3 


5 


4 


5 


• 


*> 

« 


4 


? 


• 


3 


7 



Category 2 



xl 


X2 


y 


8 


7 


9 


8 


9 


/: 
Q 


7 


6 


8 


5 


5 


? 


4 


6 


? 


3 


2 


• 


3 


? 


? 


2 


? 


? 



The missing data pattern in category 2 is said to be 
monotonic or nested since there is an ordering of the 
variables in terms of -observability." The x x variable is 
most observed, followed by x 2 , and then y which is the least 
observed. The date, in category 1 does not exhibit a 
monotonic pattern. 

Let us assume that the 3 by 3 variance covariance 
matrix of the x's and y's is the same for both categories. 
(The algorithm follows the same logic for the unequal Sj 
case) . The problem is to estimate the means for the 
variables for each category (Mlj* 1*2 j» My j , 1 - 1 > 2 ) and the 
common 3 by 3 variance covariance matrix, S. Given 
estimates of these basic parameters, one can readily 
estimate additional parameters. For example, since the 
multiple correlation can be computed from the variance 
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covariance matrix, given an estimate of the latter, we can 
estimate the former. In addition, given an estimate for S, 
one can readily obtain the maximum likelihood estimates for 
the regression weights in predicting y from xx and x 2 . 

If the data were complete, the maximum likelihood 
estimates would be obtained by first computing the following 
sums and sums of squares and cross products for the data in 
category j, j=l,2: 



The traditional maximum likelihood estimates are then 
obtained by computing the means from the sums, and 
transforming the sums of squares and cross products into 
variances and covariances. 

When missing data are present, these statistics are 
iteratively estimated by the EM algorithm. Consider as an 
example the estimation of the sum of the y's in category 1. 
Suppose initial estimates of the unknown parameters (jft, ii 2 , 
and S) are obtained from the complete data. For example, 
the mean of y in category 1 is initially estimated to be 
28/4 = 7. To estimate the sum of y in category 1, the 
missing y scores of cases 4 and 5 are estimated. These 



2x X j, Sx 2 j, Syj, 



2x 2 X j, Zx 2 2 j, 2y 2 j, 



2(Xij) (yj) , S(x 2 j) (yj) 



, 2(X X j) (x 2 j) 
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esti-ates are the expected values given the observed data 
and the initial estimates. For subject 4. » is estimated 
fro » the recession elation predicting y fro, x x and x 2 . 

. .... in turn obtained from the 

These regression weights are in turn od 

initial estimate for Z. similarly, for subject 5, y is 

estimated from the regression eguation predicting y from » 2 . 

a v» followed to estimate all of the 
Similar computations are followed 

e cross products. For example to 
sums and sums of squares of cross pro 

* efl „ a res of Y in category 1 (2y 1) the 
estimate the sum of squares or y 

7 j 2 are replaced with their 
missing values for y 4 2 and y 5 are repi 

T 4.v^« case the estimates are given by 
expected values. In this case rne 

- sauared predicted value and a residual 
the sum of <. squarea 

• „«, once all of the expected values are computed, 
variance. Once an ^ 

, likelihood estimates are computed and used as 

usual maximum HKeiiauu« 

*• The expectation step is then 

new parameter estimates, me ex P 

performed again. The process continues until the estimates 

converge. As previously noted, computer programs are 

available for performing this analysis. 

. „ in category 2 where the missing 

I*- should be noted that in catego y 

, nnic the maximum likelihood estimates 
data pattern is monotonia, the ma 

i „«r,_it-prative manner without 
can be obtained in a simple non-iterative 

, fhffl The estimates of the mean and 
using the EM algorithm. The esn 

• ce of x n can be directly computed since there are no 
variance or Xi can w<= 

for this variable. Secondly, two regression 
missing values for this van 

• residual variances can be computed 

equations and associated residual 

~* v The latter two 
analyses are performed .sing the complete x 1( x 2 and the 
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complete x 1( x 2 ,y data sets respectively. The computed 
statistics all provide maximum likelihood estimates under 
the MAR assumption. Using these estimates, cne can readily 
compute the maximum likelihood estimates of any other 
parameters which are expressible in terms o £ the originally 
estimated parameters. For example, the estimate of the mean 
of x 2 can be obtained by evaluating the regression equation 
predicting x 2 from x x at the estimated mean for x x . 
Similarly, the formula for the population multiple 
correlation of y with x t and x 2 can be expressed as a 

, ,->,= fiance of x-, and the regression weights 
function of the variance 01 *i 

and residual variances for predicting x 2 from x, and y from 
Xl and x 2 . Replacing these regression weights and variances 
by their estimates yields the maximum likelihood estimate of 
the multiple correlation. The basis for these simplified 
non-iterative analyses when the missing data pattern is 
m onotonic is explained in Little . Rubin (1987) . It is 
interesting to note that if this logic is applied in 
estimating the xy correlation for the very special monotonic 
pattern which arises when there is only a single x variable 
m easured on all applicants but there are missing data on y, 

„,.,„.,.. ls the common restriction in range 
the resulting estimate is uw 

correction formula. In general, one can estimate the xy 
correlation when there are missing data on both x and y (a 

, 1,,,,-ther. one can estimate the 

non-monotonic pattern) . Furtner, 

multiple correlation in the case where there are several x 
variables, and the missing data for the x's and y are not 
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monotonia. In these last two examples, the iterative EM 
algorithm provides the general met .od for obtaining the 
estimates. Simple formulas sucf as the correction formula 
do not exist in these cases. 

3 . Application to Real Data 

The model described above was applied in an 
investigation of the predictive validity of the selection 
process at a prestigious Northeastern secondary school where 
admission is based on the performance of applicants on a 
three part examination consisting of Mathematics, English, 
and Essay sections. Each of the applicants to the 
institution is first categorized into one of three admission 
categories based on background variables which include SES 
factors and educational history. Within each of these 
categories, the selection procedure involves two steps. 
First, applicant examinations are scored on both the 
Mathematics and English sections and those students whose 
scores fall below a determined cutoff are eliminated. The 
essay of the remaining applicants are then read and scored 
and those applicants receiving a passing score on the Essay 
are invited to attend the school. It should be noted that 
different cutoff scores are used within each admission 
category. The criterion or y variable (second year grade 
point average, CPA) is measured for nearly all invited 
applicants. Less than ten percent of the invited applicants 
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m easurement o f CP*. Xn summary. — — " 

category there are no missing data values for the 

I- and English examinations, and the missing data 
Mathematics and Engns" 

pattern for the Essay and G P> varices is mono tonic 

R s noted above, the GPA variable is measured on over 
percent o £ the oases who survived the second stage o. 

oMficallY, there were missing ^ scores 
selection. More specifically, 

. i= who were offered admission but 

for a total of 18 individuals who were 

a nut prior to the time that CPA was 
declined or dropped out prior 

4- fr^y these 18 cases, it is ciedi 
recorded. If it were not for these 

recor admission 

• - within each of tne uuc 

that the missing data witnin 

• . could be assumed to be MAR (Essay sco.es 

... „ f Mathematics and English scores; GPA 
»issing as a function of Mathema 

. ■ as a function of Mathematics, English, and 

lit:,;. — - - — - -~. - r;- 

=,= if thev were MAR, i.e, as n 
fact treat these 18 cases as if they 

th ey were simply not admitted on the basis of the three 

iables The possible bias introduced by this 
predictor variables. m« v 

' l l be later investigated by performing a 
assumption will be later ^ 

. Mlvs is where a range of values 

sensitivity analysis reoompu ted. 
scores are imputed and the parameter 

The cbserved characteristics of the applicants (means 
standard deviations, group si,es> for the «^ 
Y ear are shown in Table 1. The descriptive statistics 
re pcrted separately for each admission category. 
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INSERT TABLE 1 ABOUT HERE 



It is seen in Table 1 that every one of the 44 

i is accepted. The only missing data 
applicants in category 1 is accepted 

• hlp for a single student who declined 
occur on the CPA variable for a sing 

. • It should be noted that even though all category 

admission. It snouxa u 

4 applicants were accepted for the 1*88-1*8* school year ^ 
was still of interest to consider the predictive validity of 
the admissions tests since in subseguent years it is gui 

possible that the selection ratio for category 1 may be less 

„ 2 there were 1845 applicants of whom 

than 1.00 in category 2, there wer 

• a *-h e first selection stage and were 
only 365 survived the first 

„ fhe Essay, and from this latter group, only 
measured on the Essay, <* 

• • of these 151 cases, 140 entered the 

were offered admission. Of these 

, on CPA in category 3 where there 
school and were measured on CPA. in 

• 65 survived the first selection stage 

were 594 applicants, 65 survi nffere d 

, ha EssaY , and finally 40 were offered 
and were measured on the Essay, a 

nnlv 34 were eventually 
r>f this latter group, only 34 wei e 
admission. Of tnis xa. 

B easurea on V — i— «- «" — Qn ' 

„ ieS . Magics an* Hssa, sco.es were .easu, on 

• Fssav scores were measured on 474 

all 2483 applicants; Essay sco 

»«h CPA was measured for the 217 
members of this group; and GPA was 

accepted applicants who entered and remained in the school. 

In T able 2 the characteristics of the applicants who 

^ fhP GPA variable are 
were admitted and observed on the GPA 

ERIC 1 ' 
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presented. Thus, for a. 

•, ,i»t-a for all variables. 
Table 2, there are complete data 
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«o -in Table 2 for the entrance 
An inspection of the means in Table 

• .H OW e that different admission standards were 

examinations shows that standa rds 

. xt appears that tne 

employed for each category. It app 

Ie e Let stringent .o, category , - -st str^t - 

nations presented in Ta^e , are »ost certa n« 

„ M mates except for category 1 where 
negatively biased estimates ex P 

v, icallv are no missing data. The basis for this 
there basically art 

• is the well Known result, that by restricting the 

aSSOcia ted correlation 

range of the predictor variable, the 

is attenuated. likelih0 od estimates of 

Table 3 contains the maximum likelihoo 

* rB . (a) the means for the Mathematics, 
the following parameters, (a) tne 

CPA variables, (b) the multiple 
English, Essay, and GPA van 

• predicting GPA from the three predictors, 
correlations in predicting 

. the expected GPA of applicants 
and (c) the difference in the exp 

admitted by a lottery and those admitted on the bass 
predictor variables. This latter parameter is referred 
predictor v addition, 
as the "expected gain fro* select.cn (MS) • 
th e associated standard errcrs of t h e estimates are 
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ted It should be noted that the parameters -means 
presented. It ^ ^ , 

and multiple correlations) that 

• „4- nnol For example tne 
applY to the entire applicant pool, 
appiy w estimate of what 

estimate for the mean Essay score, is an esti 
estimate l«i would have 

M „» sc ore within a given category woui 
the average Essay score category 1 

oeen if all applicants were admitted. Except 

any all applicants are accepted) , the means „ 
(wher e nearly all PP ^ ^ a ^ Qnly 

Table 3 are uniformly lower 

admitted applicants are considered. 

INSERT TABLE 3 ABOUT HERE 

' •„ m,ble 3 is defined in 

The EGS parameter presented in Table 

nner If applicants within category 1 - 
the following manner. n 

selected by a lottery, the expected value of 
ll2 ,3 were selected » ^ 

their average GPA ^ expeoted va lue can 

or ejected value or «. W ^ 
be obtained by evaluating the r g 

, the three predictor variables (labeled for 
GPA from the three P predictors, 
convenience as 1,2,3, at the mean value for P 

,3, are the means for the Mathematics, 
where Mij » **2} ' ^ . . f 

ions for the population 01 
English, and Essay examinations 

• , paorv i Estimates of these means are 
applicants in category 3. 

. QimilarlY, the expected GPA ror 
obtained from Table 3. Similarly, 
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selected applicants [E(GPAj|S)J is obtained by evaluating 
the category regression equation at the mean predictor 
scores of selected applicants: 

E(GPAj|S) = fi 0 j + BxjEfXxjlS) + (2) 

B 2 jEx 2 j|S) + B 3 jE(x 3 j|S), 

where E(xij|S) is the expected value of predictor i for the 
selected cases in category j . Estimates of these expected 
values can be obtained from Table 2. 

The EGS parameter is then the difference in the 
expected values given in equations (1) and (2) . 

EGSj = Bijdij + B 2 jd 2 j + B 3 jd 3 j (3) 

where d^j = /x^j - E(Xij|S) is the difference in category j 
between the mean applicant score and the mean accepted 
applicant score for predictor i. Setting the values of the 
dij parameters to be the observed applicant-admitted 
differences (obtained from Tables 1 and 2) , and replacing 
the population regression weights by their maximum 
likelihood estimates, one obtains an estimate of EGSj, 
j=l,2,3. Since virtually all applicants in category 1 were 
admitted, the EGS parameter was estimated only for 
categories 2 and 3 . 

Two major conclusions can be drawn from the estimates 
in Table 3. First, the predictor variables are clearly 
statistically valid predictors of GPA for categories one and 
three. This result can be seen by testing the R 2 values for 
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significance using a simple z test, where z = R 2 / (stanc* xrd 
error of R 2 ) . The observed R 2 values of .34 and .40 each 
yield significant (p < .05) z values. Further evidence for 
predictive validity in the third category is the finding 
that the expected gain from selection is sigificantly 
different from zero ( z = 7.34/2.98 = 2.46, p < .05). 
Second, although the results in Table 3 for the second 
admission category fail to reach statistical significance, 
they strongly suggest that the admissions variables are also 
valid for this category. A z test of the estimated R 2 value 
in category two, (z = .14/. 09 = 1.56) is nearly significant 
(p < .06). Similarly, the estimated EGS value (1.90) for 
category three also approaches signifcance (z = 1.90/1.39 = 
1.37, p < .08). These findings may very well be 
attributable to low levels of statistical power for the 
signifcance tests of the R 2 and EGS parameters. This issue 
is discussed more fully in the following section. 

It is also of interest to consider the size of the 
standard errors and the associated confidence intervals for 
the estimates presented in Table 3. Although there is 
evidence for the predictive validity of the predictors in 
all three categories, the multiple correlation and EGS 
parameters cannot be precisely estimated. For example, the 
.95 confidence intervals for the squared multiple 
correlations in categories 1 and 3 are [.09, .59] and [.01, 
.79] respectively. Similarly, the .95 interval for the EGS 
in category 3 is [1.50, 13.18]. The width of the confidence 
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intervals reflects not only the large amount of missing 
data, but also the decision to analyze the data using the 
heterogeneous Sj model, since far fewer parameters are 
estimated under the homogeneous Zj model, in the latter case 
the standard errors and the widths of the associated 
confidence intervals would certainly have been smaller. We 
will return to this issue in the final section. 

All of the maximum likelihood estimates given in Table 

3 were obtained under two assumptions: (a) the variance 
covariance matrices within each category are heterogeneous; 
(b) all of the missing GPA scores are MAR. We now consider 
the validity of each of these assumptions. The decision to 
present the estimates for the heterogeneous Zj model was 
based on the result of a log-likelihood ratio test. First 
the estimates of the category parameters (means, variances, 
covariances) were obtained under the assumption that the Zj 
matrices are heterogeneous. For this analysis, the EM 
algorithm was applied separately to the data in e?.~h 
category using the BMDP-AM computer program. Secondly, the 
parameters were reestimated under the assumption of a common 
variance covariance matrix. The Schlucter computer program 
was used for this analysis. Whereas the number of 
parameters estimated in the first analysis was 42 (4 means, 

4 variances, and six covariances within each of three 
categories) , the second analysis estimated only 22 
parameters (4 means in each of the three categories, and the 
10 elements of the common variance covariance matrix) . The 
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null hypothesis of a common S matrix was tested by computing 
a chi square log-likelihood ratio test with 42-22=20 degrees 
of freedom. Tnis statistic was computed by taking the log 
of the ratio of the likelihood of the data given the two 
sets of estimates. The resulting chi square was highly 
significant leading to the rejection 02 the equal F.j model. 

Consider the second assumption that all of the missing 
GPA scores are MAR. As previously noted, this assumption is 
tenable for all but the 18 applicants who although offered 
admission, either declined or left the school prior to the 
time that the GPA variable was measured. To investigate the 
possible effect of violations in the MAR assumption, a 
sensitivity analysis was performed where different values 
were imputed for these missing scores and the parameter 
values reestimated. When the MAR assumption holds, the 
regression equations predicting GPA (obtained from the 
complete data of the admitted applicants within each 
category) will provide unbiased estimates of the GPA scores 
of the 18 missing cases. However, when the MAR assumption 
does not hold, i.e. when these GPA scores are not missing 
simply as a function of the predictor variables, the 
previous equations will yield biased predictions (Gross, 
1987). To allow for the possibility that these 18 scores 
are not MAR, we imputed the missing values using the 
category regression equation together with an adjustment 
factor. More specifically, given an applicant from category 
j, with predictor scores x 1# x 2 x 3 , the predicted GPA was 
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obtained using the complete case regression equation. This 
predicted value was then modified by adding an adjustment 
factor. This factor was either positive or negative one 
residual standard deviation unit. In other words, the 18 
missing values were estimated to be either systematically 
higher or lower than their predicted values. The squared 
multiple correlations and the EGS estimates for the three 
categories were then recomputed using the data set which 
included the imputed values. The results of these 
computations presented in Table 4 suggest that the analysis 
is fairly robust with respect to the assumption that the 18 
missing GPA scores are MAR. The estimates of the squared 
multiple correlations and EGS values are substantially 
unchanged when different values (one residual standard 
deviation unit above and below the predicted value) are 
imputed for these missing scores. For example, in category 
3, the squared multiple correlation varies from the original 
maximum likelihood estimate of .40 by no more than .05 units 
under the two imputations. Similarly, the EGS values are 
changed by less than one point. 

4 . Conclusions 

In analyzing any data set which contains missing data 
one must introduce assumptions concerning the missing data 
process to obtain statistically accurate estimates of the 
underlying parameters. The sensitivity of the analysis to 
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these assumptions is clearly a function of the proportion of 
missing data. Violations of the underlying assumptions may 
be only a minor problem when there are relatively little 
missing data, but can lead to highly biased estimates when 
the proportion of missing data is high. In investigating 
the predictive validity of a test battery for a highly 
selective institution, there will be by definition large 
amounts of missing data. The key assumption in this type of 
analysis is that the missing data can be accounted for in 
terms of the observed and measured variables, i.e., the 
missing data are missing at random. The proposed model 
represents an attempt to assure that this assumption will be 
satisfied by measuring not only the predictor variables, but 
also noting the admission category of the applicant. While 
the predictors alone cannot always account for the missing 
data, the measurement of the predictor variables together 
with the admission category may very well yield a data set 
where the missing data are missing at random. 

Although the proposed model is quite general and can be 
employed when there are missing data on the predictors, 
criterion and the admission category, the model is still 
potentially limited in two ways. As previously noted, the 
missing criterion scores of those applicants who decline an 
admissions offer or soon drop out, may not be MAR. In other 
words, there may be additional variables (statistically 
related to the criterion) which are responsible for these 
missing scores. For this problem, we have suggested that 
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the investigator perform a sensitivity analysis where a 
range of reasonable values for these missing criterion 
scores are imputed and the parameters reestimated. For the 
data set considered in the present paper, the sensitivity 
analysis suggested that the model was robust with respect to 
the assumption that the missing criterion scores of admitted 
applicants wer*. MAIL 

The second limitation of the proposed model is that 
although it yields unbiased estimates of the relevant 
parameters, both the precision of these estimates as well as 
the power of any significance tests based on these estimates 
may not be high due to the large amount of missing data as 
well as the form of the missing data patterns. The problem 
of precision was noted in terms of the rather wide 
confidence intervals obtained for the sguared multiple 
correlations and the EGS parameter. The issue of low power 
is most clearly seen in the results for the second admission 
category where the results only approached significance. It 
can be argued that this finding is attributable to low 
levels of statistical power. More specifically, for the 
second admission category, complete data could be observed 
for only 140 of 1845 applicants. Further, within this 
complete case group, there was considerable restriction in 
the range of the predictor variables. The ratios of the 
standard deviations of each predictor variable in the 
selected group to the corresponding standard deviation for 
the applicant group were .52, .67, and .70 for the 
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Mathematics, English, and Essay examinations respectively* 
Although the complete sample size of 140 is not small, the 
high levels of range restriction within this sample can 
easily result in low power. 

The problems of wide confidence intervals and low power 
in testing hypotheses are most likely to occur when the 
heterogeneous 2j model is employed. In this case, the 
parameter estimates e*~e obtained using only the data from a 
single admission category at a time. It is clear hoover, 
that the standard errors would be much smaller if the pooled 
or homogeneous Sj analysis were employed. Thus, the 
proposed model should be most useful when the data are 
consistent with the hypothesis of common variance covariance 
matrices across the admission categories. In addition to 
the choice of the heterogeneous or homogeneous Sj model, the 
precision of the estimates will be a function of factors 
such as the proportion of missing data, the number of 
admission categories, the overall sample sizes for each 
admission category, and the form of the missing data 
patterns, i.e. the degree of range restricition. It would 
be of practical value to provide some general guidelines in 
terms of these factors for identifying the data sets where 
the model can be expected to provide reasonably precise 
estimates. The construction of these guidelines would 
clearly be a useful area for future research. 
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Table 1 

Observed Characteristics of Applicants 



Categ. 



N applic a N offer b 



44 



44 



N attend 



43 



Mean 
Math 

51.14 
[13.06] 
(44) 



Mean 
English 

42.91 
[8.14] 
(44) 



Mean 
Essay 

84.52 
[31.29] 
(44) 



1845 



151 



140 



39.42 
[12.21] 
(1845) 



37.42 

[8.08] 

(1845) 



91.45 
[30.52] 
(365) 



594 



40 



34 



34.27 
[12.01] 
(594> 



33.85 
[7.53] 
(594) 



89.37 
[32.59] 
(65) 



Note. Standard deviations are in brackets, sample sizes are parenthesized. 

a Napplic = number of applicants. 

N offer = number of applicants offered admission. 

° N attend = number of applicants offered admission who attended and 
were measured on y. 
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Table 2 

Observed Characteristics of Admitted Applicant ? 

Categ. Mean Mean Mean Mean R 2 a 

Math English Essay GPA 

1 (N=43) 50.67 42.88 85.05 86.f,0 .32 

[12.84] [8.25] [31.47] [4.76] 

2 (N=140) 59.54 49.11 102.17 89.72 .06 

[6.36] [5.41] [21.41] [4.12] 

3 (N=34) 55.74 45.23 103.14 88.14 .15 

[6.90] [6.92] [25.52] [4.98] 



Note. Standard deviations are in brackets. 
a„2 



R The squared multiple correlation predicting GPA from Math, English, 
and Essay for applicants measured on all variables. 
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Table 3 

Maximum Livelihood Estimat e s and Sta ndard Krrorg 



Category 


Mean 
Math 


Mean 
English 


Mean 
Essay 


Mean 
GPA 


EGS a 


R 2 b 


1 


51.14 
(2.36) 


42.91 
(1.34) 


84.52 
(5.22) 


86.62 
(0.75) 




.34 
(.13) 


2 


39.42 
(0.29) 


37.42 
(0.19) 


85.87 
(4.96) 


87.82 
(1.44) 


1.90 
(1.39) 


. 14 
(.09) 


3 


34.27 
(0.53) 


33.85 
(0.33) 


58.01 
(14.19) 


80.80 
(3.16) 


7.34 
(2.98) 


.40 
(.20) 



Note: Standard errors are in parentheses. 
a EGS 



The difference in the expected value of GPA between applicants 
selected in terms of the x variables and those selected by a lottery. 

orL?c?^rrii k ; lih °S d . eStima ? e e f the s <J uared multiple correlation in 
predicting GPA from Math, English, and Essay. 
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Table 4 

Sensitiv ity Analysis 



Category 


R 2 MAR 3 


EGS MAR 


R2 A C 


R 2 B d 


EGS A e 


EGSq f 


1 


.34 




.34 


.31 






2 


.14 


1.90 


.12 


.18 


1.31 


2.50 


3 


.40 


7.34 


.37 


.45 


6.73 


7.96 



cl 2 

R MAR Squared multiple correlation value assuming the "18 cases" 
are MAR. 

EGS MAR Expected gain from selection assuming the "18 cases" are MAR. 
c 2 

R A Squared multiple correlation value where the 18 missing GPA's are 
imputed to be one residual standard deviation unit above the 
predicted value* 

ci 2 

R B Squared multiple correlation value where the 18 missing GPA's are 
imputed to be one residual standard deviation unit below the 
predicted value* 

EGS A Expected gain from selection where the 18 missing GPA's are imputed 
to be one residual standard deviation unit above the 
predicted value. 

EGS B Expected gain from sexection where the 18 missing GPA-s are imputed 
to be one residual standard deviation unit below the 
predicted value. 
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